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Abstract 

The genetic code is considered to be universal. In order to test if some statistical 
properties of the coding bacterial genome were due to inherent properties of the 
genetic code, we compared the autocorrelation function, the scaling properties and 
the maximum entropy of the distribution of distances of amino acids in sequences 
obtained by translating protein-coding regions from the genome of Borrelia burgdor- 
feri, under different genetic codes. Overall our results indicate that these properties 
are very stable to perturbations made by altering the genetic code. We also discuss 
the evolutionary likely implications of the present results. 
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1 Introduction 



Organisms use the genetic code to translate the information stored in DNA or 
RNA nucleotide sequences to synthesize amino acids sequences called proteins. 
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The same code is used in all living organisms (there is an exception in the 
mitochondrial genome), so it is nearly universal [1]. 

The universality of the genetic code suggests that it should have been estab- 
lished early in evolution, so once it appeared in nature it was "frozen" [2]. 
An unanswered question is whether other genetic codes could accomplish the 
same function or be as efficient as the actual universal genetic code. 

In order to test the properties of different genetic codes, we analyzed some dis- 
criminating statistics of the distance distribution of amino acids (aa) derived 
from protein-coding regions from the genome of Borrelia burgdorferi, through 
numerical experiments in which the actual genetic code was perturbed. The 
chosen statistics are related to the content of information, the scaling proper- 
ties of the distances series, and the autocorrelation properties. 



2 Experimental design 

We started with a sequence of protein-coding regions of the genome of Bor- 
relia burgdorferi (see [3]). This sequence was translated to a sequence of aa 
using different genetic codes (see below). For the three stop codons, we as- 
signed character X. Then, we generated distance series between identical aa 
along the chromosome for each character (either aa or an stop codon). Our 
master control was the universal genetic code itself. As negative controls, we 
considered both a shuffled version of the original sequence (shuffled code), and 
a synthetic DNA sequence 10 6 nucleotides long, obtained by sampling the four 
DNA nucleotides with replacement (random code). 

Instead of the classic random walk mapping of a DNA sequence [4,5], we 
followed a different approach for studying the statistical properties of aa se- 
quences. In particular, for a given aa, we determined its actual position along 
the whole sequence, and from this we measured, as distance, the number of 
aa which lies between two identical characters. Hence, we obtained the actual 
distance series for each character. 



2.1 Genetic codes 

The universal code presents a characteristic distribution of codons to aa. In 
this distribution, there are several aa which are encoded by more than one 
codon, so it is degenerated [1]. Often the base in the third position is less 
significant, as a mutation in this position does not imply a change in the 
encoded aa (third-base degeneracy). 
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The first perturbation was called code 2. Amino acids were randomly assigned 
to codons, preserving the universal distribution of codons to aa i.e. the degen- 
eracy of the code is the same as the universal code. In this way, a mutation 
in the third base altered the coded aa, thus the third-base degeneracy is not 
longer sustained. 

In the uniform code we assumed a uniform distribution of codons to aa, so 
each aa is coded by three different randomly chosen codons. As there are 20 
aa, the uniform code has four stop codons. 

For generating what we called the crazy code, we built a population, in which 
the 21 characters (representing either an aa or an stop codon) were sampled 
with replacement, and each of them was randomly assigned to one out of the 
64 codons, so the distribution of codons to aa is both not universal-like, and 
not uniform, e.g. three different aa can be translated with five different codons. 

Finally, we also tested a perturbation following the RNA world hypothesis [6] 
using the RNY (purine-any nucleotide-pyrimidine) pattern proposed by Eigen 
& Schuster [7]. In the RNA world code aspartic acid is coded by GAC and 
GAU; asparagine is coded by AAC and AAU; alanine is coded by GCU and 
GCC; isoleucine is coded by AUU and AUC; glycine is coded by GGC and 
GGU; serine is coded by AGC and AGU; threonine is coded by ACU and 
ACC; and, valine is coded by GUU and GUC. This code has been proposed 
as the primeval genetic code (see also [8]). 

2.2 Statistical analysis 

In order to test for statistical differences among the master, the negative and 
the synthetic codes (code 2, uniform code, crazy code and RNA world code), 
both the p-value of the Wilcoxon-Mann-Withney test [9] and the bootstrap 
[10] 95% confidence intervals (C.I. 95) were calculated for three different statis- 
tics. The first technique is used to test differences of means regardless of the 
particular distribution of the random variable and the latter estimates non- 
parametric confidence intervals by sampling few data several times (e.g. 1000 
times) with replacement. The bootstrap C.I. 95 from the random code gives 
an estimation of the white noise bandwidths. 

The chosen statistics were: a) The average of the mean of the first 38 lags of the 
autocorrelation function (ACF); b) Mean of the detrended fluctuation analysis 
scaling exponent (DFA)[11]; c) Average of the geometric variation coefficient 
maximum entropy (ME — gvc), where gvc is the ratio of sd(x)/x, where sd(x) 
is the standard deviation of the maximum entropy of the series x, and x is the 
geometric mean of the maximum entropy of the series x. Maximum entropy 
of the series was calculated with the SSA-MTM Toolkit, v4.2 [12]. All means 



3 



were calculated for the 21 characters within each code. 

Both the ACF and the DFA look for autocorrelations within the series. The 
DFA technique is based on a modified root mean square analysis of a ran- 
dom walk, to assess the intrinsic correlation properties of a dynamic system 
separated from external trends in the data, and is intended to determine the 
scaling properties of a time series [11]. When the DFA calculated scaling ex- 
ponent is equal to 0.5 it is indicative of white noise; if the value lies betweeen 
0.5 and 1 then the time series exhibits long-range correlations. 



3 Results 



The distance distributions for each particular aa changed for all the tested 
codes. Fig. 1 shows the probability density function (pdf) of the distance 
distribution for aspartic acid as an example. Interestingly, the pdfs of various 
aa obtained with different codes presented an oscillatory decaying pattern. 
The only exception to this pattern was observed in the RNA world code, in 
which none of the aa presented oscillations. 
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Fig. 1. Probability density functions of the distance distribution of aspartic acid. 
Distance is measured as the number of aa that are between two identical aa. a) 
Universal code; b) Random code; c) Code 2; d) Uniform code; e) Crazy code; f) 
RNA world code. 

Several works have reported periodical patterns in DNA sequences by means 
of autocorrelation function (ACF) analysis [13,14,15,16,17]. We also utilized 
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the ACF, but applied to the aa distance series, coming from both the uni- 
versal code and the negative controls (shuffled and random codes). We found 
autocorrelations in the universal code, and no autocorrelation for all lags in 
the negative controls (as indicated by the bandwidth of white noise). As can 
be seen from Table 1, there is a clear difference, as measured by the boot- 
strap C.I. 95 between the universal code and the negative controls. Further, 
we tested if the average of the mean of the first 38 lags of the ACF of the 
universal code, was statistically different to any of the other codes. We found 
differences with code 2, crazy code, and uniform code with a non-parametric 
test. These synthetic codes distributions were displaced to lower values to re- 
spect of the universal code distribution, although with an slight overlapping 
(see Table 1). 

Table 1 



Bootstrap confidence intervals (C.I.) for: autocorreletion function analysis (ACF); 
detrended fluctuation analysis scaling exponent (DFA); and geometric variation 
coefficient of maximum entropy (ME-gvc) . 



Code 


C.I. (ACF) 


C.I. (DFA) 


C.I. (ME-gcv) 


Universal 


0.09 - 0.12 


0.72 - 0.75 


0.87 - 1.14 


Shuffled 


0.03 - 0.06* 


0.56 - 0.59* 


0.23 - 0.46* 


Random 


-1.5e-3 - 1.26-7* 


0.49 - 0.51* 


0.07 - 0.09* 


Code 2 


0.07 - 0.09* 


0.69 - 0.73* 


0.61 - 0.84* 


Crazy 


0.06 - 0.10* 


0.69 - 0.73* 


0.62 - 0.90* 


Uniform 


0.07 - 0.10* 


0.71 - 0.74 


0.67 - 0.89* 


RNA World 


0.06 - 0.13 


0.65 - 0.73* 


0.51 - 1.11 



*p < 0.05, *p < 0.01, *p < 0.001 (Wilcoxon-Mann-Withney test) 



Several authors have reported long-range correlations in DNA [5,18,19,20,21]. 
Here, in order to look for long-range correlations between each aa, we calcu- 
lated the DFA scaling exponent [11] for each distance series, and then tested 
for differences in the mean value for each code against the corresponding value 
obtained with the universal code. We found statistical differences with code 
2, crazy code and the RNA world code, although with slight overlaps in the 
bootstrap C.I. 95 in all cases (see Table 1). As expected, the DFA scaling 
exponents of the negative controls lie within (random) or very close (shuffled) 
to values that indicate brownian motion. It is worth to mention, that both 
negative controls are strikingly different when compared with all other codes. 

Entropy, as a measure of information, has also been used to analyze DNA 
and to make comparisons between coding and non-coding regions [22,23,24]. 
In the current study, we calculated the average of ME-gvc for each distance 
series. Again there was a clear difference between the statistics of the negative 
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controls, and all the other codes. Statistical differences, were also found with 
code 2, crazy code and uniform code against the universal code with slight 
overlap in the bootstrap C.I. 95 (see Table 1). 



4 Discusion 

There have been several papers, some of them considered as classics, which 
have contributed to our understanding of the origin of the genetic code [6,7,2,25,26,27,8]. 
However, none of them addressed the statistical properties of the translated 
sequence of aa. Here we carried out numerical experiments with different ge- 
netic codes in which some statistical properties of the translated products 
are analyzed. In order to study the coding DNA, we based our analysis on 
sequences of aa obtained by translating the protein coding sequence from 
Borrelia burgdorferi genome. 

Peng, et al. [5] have found that noncoding DNA sequences show long-range 
autocorrelations whereas coding sequences do not. This is remarkable in the 
case of bacterial chromosomes since most of the DNA content is coding. Indeed, 
they showed that in bacteria there is a lack of autocorrelation. We found 
autocorrelation in both DNA coding sequences [3] and in aa sequences coming 
from translating bacterial coding DNA. These apparently contradictory results 
are presumably due to differences in the experimental design, as we looked for 
the distance series between characters (either aa or n— tuples of DNA). 

In general the bootstrap C.I. 95 for all the tested statistics of the synthetic 
codes, showed a diminution of information content, weaker long-range corre- 
lations, and smaller values of the scaling exponent, when compared with the 
master code. Then, the universal code seems to contain optimum values for 
those statistics. 

Regardless of finding statistical differences with alternative codes, in all cases 
the statistics have values closer to the universal code than to the negative 
controls. This suggests that the genetic code is very robust to perturbations, 
as information measures, as well as long and short correlations are maintained. 
Thus, once the universal code was established, it became fixed and resistant to 
evolutionary changes. The question of what makes unique the universal code 
remains an unanswered problem. 
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